Trump vs Biden 2020 Presidential Campaign Twitter Sentiment Introduction One of the hottest topics in 2020 was the presidential election between Donald J. Trump and former vice president Joe Biden. This election was full of firsts; It was the first election held during a worldwide pandemic, the first election to have 3 states whose margin of victory was under 1%, and the first incumbent president not to concede. Because of the election's uniqueness and the highly contrasting personalities between the candidates, I decided to analyze the sentiment of each candidate’s tweets to find out how much their social media attitude affected their twitter impressions and their general approval.
In the analysis, we mapped the candidate's sentiment in a time series as well as measured how popular their most liked and dislike tweets were through retweets and likes. We wanted to see if there was a correlation between negative or positive sentiment and popularity for each candidate.
We got the motivation for this topic from an article published by Cambridge University Press titled: “Differences in negativity bias underlie variations in political ideology”. This article discusses how negative thoughts gain more attention and popularity than positive thoughts do. Negative thoughts also stay within our memory for a longer period. This also links to negative bias in politics, which triggered an idea to apply sentiment analysis in politics. The 2020 presidential election was the perfect area to focus this analysis on negative political bias.
import tweepy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
from textblob import TextBlob
import datetime
import re
from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
import seaborn as sns
[nltk_data] Downloading package stopwords to [nltk_data] C:\Users\lalin\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package punkt to [nltk_data] C:\Users\lalin\AppData\Roaming\nltk_data... [nltk_data] Package punkt is already up-to-date!
We will get the most recent Trump and Biden tweets by web scraping twitter to obtain tweets from August 2020 - November 2, 2020
The script below gets the 2000 most recent tweets from a twitter account and stores it as a pkl file. We will only run it occasionally to control the API requests we make to twitter as they have a limit on how many you can make.
# dont run these
# AccessToken = '1325166553-AaLqrAHSEzFQg0KOQPrfL5B46EiOuvw3bWV7FUb'
# AccessTokenSecret = 'ahA6V1c7wQoByveazFsq1dcf2YrqccUqbmlPuWCNUJ8Yo'
# API_key = 'BtsZY4DuRx6D6WuYRmyRKc5gr'
# API_key_secret = 'hcqxqk31EzMpKMcRvNdQKDzpGVmsShVlfpX3k1vDXR6zzbIwuJ'
# auth = tweepy.OAuthHandler(API_key,API_key_secret)
# auth.set_access_token(AccessToken, AccessTokenSecret)
# api = tweepy.API(auth)
# user = api.me()
# print(user.name)
#tweets = []
#for page in range(1,10):
# tweets.extend(api.user_timeline(screen_name = "realDonaldTrump", count=200, page=page))
#print("Number of tweets extracted: {}.".format(len(tweets)))
#with open('trump_tweets.pkl', 'wb') as f:
# pickle.dump(tweets, f)
The pkl files will be named trump_tweets.pkl and biden_tweets.pkl accordingly.
We will import the historical twitter data for Trump and Biden from the Kaggle repositories (Up to 8/30/2020)
We renamed the CSV files to TrumpOld.csv and BidenOld.csv respectively.
trumpOld = pd.read_csv("./Data/TrumpOld.csv")
bidenOld = pd.read_csv("./Data/BidenOld.csv")
First we'll print out 5 tweets to evaluate their format
file = open('./Data/trump_tweets.pkl','rb')
ttweets = pickle.load(file)
for tweet in ttweets[:5]:
print(tweet.text)
These states in question should immediately be put in the Trump Win column. Biden did not win, he lost by a lot!… https://t.co/w7y0zDaYdL Big legal win in Pennsylvania! RT @realDonaldTrump: STOP THE COUNT! Jobless Claims Dip to 751,000, Lowest Since March https://t.co/dzuJpS78nb via @BreitbartNews Fmr NV AG Laxalt: ‘No Question‘ Trump Would Have Won Nevada ‘Convincingly‘ Without Mail-in Voting https://t.co/pm4Wpfr6x0 via @BreitbartNews
file = open('./Data/biden_tweets.pkl','rb')
btweets = pickle.load(file)
for tweet in btweets[:5]:
print(tweet.text)
I extend my deep condolences to the loved ones of the peacekeepers, including 6 American service members, who died… https://t.co/h5ZF41fR9C RT @Transition46: President-elect Biden spoke this morning with His Holiness Pope Francis. https://t.co/om635SC3M9 https://t.co/DYuiiphOE0 Because of the Affordable Care Act: - People with pre-existing conditions are protected - More than 20 million Ame… https://t.co/GglK0KyJe7 Ron Klain’s deep, varied experience and capacity to work with people all across the political spectrum is precisely… https://t.co/KOx0BvNlae This Veterans Day, I feel the full weight of the honor and the responsibility that has been entrusted to me by the… https://t.co/VjsNzut0R3
It seems that based off the first 5 tweets for each candidate, there are links, retweet identifiers (RT) and punctuation that we'll have to clean before applying NLP or sentiment analysis processes.
We'll also examine the "keys" for each tweet, which are the variables associated with the tweet object.
We can use twitter's official developer documenation to examine what each variable means:
https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet
ttweets[0].__dict__.keys()
dict_keys(['_json', 'created_at', 'id', 'id_str', 'text', 'truncated', 'entities', 'source', 'source_url', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'author', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'quoted_status_id', 'quoted_status_id_str', 'quoted_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'possibly_sensitive', 'lang'])
The data obtained from Twitter's API contains a column named rewtweet_status which identifies whether or not a tweet is a retweet, according to the documentation. We will use this to take out retweets from the pkl files
town_tweets = [tweet for tweet in ttweets if not hasattr(tweet, 'retweeted_status')]
town_tweets[0].text
'These states in question should immediately be put in the Trump Win column. Biden did not win, he lost by a lot!… https://t.co/w7y0zDaYdL'
bown_tweets = [tweet for tweet in btweets if not hasattr(tweet, 'retweeted_status')]
bown_tweets[0].text
'I extend my deep condolences to the loved ones of the peacekeepers, including 6 American service members, who died… https://t.co/h5ZF41fR9C'
pkl files¶We'll make dataframes for Trump and Biden from the pkl files.
dfTrump = pd.DataFrame(data = [[tweet.id, tweet.created_at, tweet.text, tweet.favorite_count, tweet.retweet_count] for tweet in town_tweets],
columns= ['id', 'date', 'tweet', 'likes', 'retweets'])
dfTrump = dfTrump[dfTrump.date <= pd.to_datetime('2020-11-02', infer_datetime_format=True)]
print(dfTrump.shape)
(1305, 5)
dfBiden = pd.DataFrame(data = [[tweet.id, tweet.created_at, tweet.text, tweet.favorite_count, tweet.retweet_count] for tweet in bown_tweets],
columns= ['id', 'date', 'tweet', 'likes', 'retweets'])
dfBiden = dfBiden[dfBiden.date <= pd.to_datetime('2020-11-02', infer_datetime_format=True)]
print(dfBiden.shape)
(1492, 5)
pkl and historical csv files¶We will also merge the historical tweets data with the recently web scraped tweet data and assure there are no duplicates.
We will also need to remove retweets, adjust the column order/names, refactor the date column, and only include data from 4/25/2019 onwards, as that's the date when Joe Biden was chosen as a presidential candidate.
#Remove retweets using isRetweet column
trumpOld = trumpOld.loc[trumpOld.isRetweet == "f" , ["id", "date", "text", "favorites", "retweets"]].copy()
#prepare columns for joining
trumpOld.columns = ["id", "date", 'tweet', 'likes', 'retweets']
#refactor date column to datetime dtype
trumpOld.date = pd.to_datetime(trumpOld.date, infer_datetime_format=True)
#Only include data after april 25, 2019
trumpOld = trumpOld[trumpOld.date >=pd.to_datetime('2019-04-25', infer_datetime_format=True)]
print(trumpOld.shape)
(7484, 5)
#prepare columns for joining
bidenOld = bidenOld.loc[:, ["id", "timestamp", "tweet", "likes", "retweets"]].copy()
#change column names
bidenOld.columns = ["id", "date", 'tweet', 'likes', 'retweets']
#refactor date column to datetime dtype
bidenOld.date = pd.to_datetime(bidenOld.date, infer_datetime_format=True)
#Only include data after april 25, 2019
bidenOld = bidenOld[bidenOld.date >=pd.to_datetime('2019-04-25', infer_datetime_format=True)]
print(bidenOld.shape)
(3264, 5)
Finally we will join the datasets together.
TrumpFinal = trumpOld.append(dfTrump)
print(TrumpFinal.shape)
(8789, 5)
BidenFinal = bidenOld.append(dfBiden)
print(BidenFinal.shape)
(4756, 5)
TrumpFinal = TrumpFinal.drop_duplicates(subset=['tweet']).copy()
BidenFinal = BidenFinal.drop_duplicates(subset=['tweet']).copy()
print(TrumpFinal.shape)
print(BidenFinal.shape)
(8617, 5) (4755, 5)
It seems like Trump had almost 200 duplicate tweets while Biden had 1.
We use regex to remore words and symbols that don't mean anything in speech such as:
@ symbolRT)#)def cleaner(txt):
#remove '@' mentions
txt=re.sub(r'@[A-Za-z0-9]+','',txt)
#Remove reweet symbols (RT)
txt=re.sub(r'#','',txt)
#remove hashtag symbols (#)
txt=re.sub(r'RT[\s]+','',txt)
#Remove website links
txt=re.sub(r'https?:\/\/\S+','',txt)
#Remove any other symbol
txt=re.sub(r'[^\w]', ' ', txt)
#Punctuation
txt=re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', txt)
txt=re.sub(r'amp', '', txt)
return txt
TrumpFinal['tweet'] = TrumpFinal['tweet'].apply(cleaner)
TrumpFinal = TrumpFinal[(TrumpFinal['tweet'] != "") & (TrumpFinal['tweet'] != " ")].copy()
print(TrumpFinal.shape)
BidenFinal['tweet'] = BidenFinal['tweet'].apply(cleaner)
BidenFinal = BidenFinal[(BidenFinal['tweet'] != "") & (BidenFinal['tweet'] != " ")].copy()
print(BidenFinal.shape)
(7732, 5) (4749, 5)
A large portion of tweets from Trump are too short and vague to be included in the NLP analysis. A large portion of these tweets, are thanking people with the hashtage(#) symbol, which has been removed. So because of tweets liket these, we will remove tweets that are 5 of fewer words.
TrumpFinal[TrumpFinal.tweet.str.contains("THANK YOU")].tail()
| id | date | tweet | likes | retweets | |
|---|---|---|---|---|---|
| 1185 | 1.303835e+18 | 2020-09-09 23:18:46 | These are my real words about our GREAT HEROES... | 90033 | 33326 |
| 1320 | 1.301956e+18 | 2020-09-04 18:49:55 | THANK YOU MAGA | 113728 | 24547 |
| 1321 | 1.301937e+18 | 2020-09-04 17:37:20 | A GREAT HONOR THANK YOU | 86395 | 23003 |
| 1335 | 1.301694e+18 | 2020-09-04 01:31:46 | THANK YOU LATROBE PENNSYLVANIA MAGA | 70475 | 15963 |
| 1359 | 1.301249e+18 | 2020-09-02 20:03:06 | THANK YOU NORTH CAROLINA | 83046 | 19894 |
TrumpFinal = TrumpFinal[TrumpFinal.tweet.str.split().str.len().gt(5)].copy()
BidenFinal = BidenFinal[BidenFinal.tweet.str.split().str.len().gt(5)].copy()
print(TrumpFinal.shape)
print(BidenFinal.shape)
(6695, 5) (4617, 5)
Tweets that are considered outliers in terms of how many likes they got will be removed. If the number of likes for a tweet is more than two standard deviations away from the mean of likes.
def removeOutliers(df):
mean = np.mean(df.likes)
sd = np.std(df.likes)
df = df[df.likes > (mean - 2 * sd)].copy()
df = df[df.likes < (mean + 2 * sd)].copy()
return df
TrumpFinal = removeOutliers(TrumpFinal)
BidenFinal = removeOutliers(BidenFinal)
print(TrumpFinal.shape)
print(BidenFinal.shape)
(6444, 5) (4471, 5)
We will use the TextBlob library to assign sentiment and subjectivity to tweets.
a = TextBlob("Wear your mask")
b = TextBlob("Have a good day")
c = TextBlob("I hate you")
print("Sentence: " + str(a))
print("Sentiment: " + str(a.sentiment.polarity))
print("Sentence: " + str(b))
print("Sentiment: " + str(b.sentiment.polarity))
print("Sentence: " + str(c))
print("Sentiment: " + str(c.sentiment.polarity))
Sentence: Wear your mask Sentiment: 0.0 Sentence: Have a good day Sentiment: 0.7 Sentence: I hate you Sentiment: -0.8
Sentiment is how positive or negative a sentence is
Sujectivity is how much of an opinion or fact a sentence is (0 is opinion, 1 is fact)
TrumpFinal['sentiment'] = TrumpFinal['tweet'].apply(lambda tweet: TextBlob(tweet).sentiment.polarity)
TrumpFinal['subjectivity'] = TrumpFinal['tweet'].apply(lambda tweet: TextBlob(tweet).sentiment.subjectivity)
print(TrumpFinal.shape)
(6444, 7)
BidenFinal['sentiment'] = BidenFinal['tweet'].apply(lambda tweet: TextBlob(tweet).sentiment.polarity)
BidenFinal['subjectivity'] = BidenFinal['tweet'].apply(lambda tweet: TextBlob(tweet).sentiment.subjectivity)
print(BidenFinal.shape)
(4471, 7)
Add a -1, 0, 1 label for each tweet
TrumpFinal['label'] = 1
TrumpFinal.loc[TrumpFinal['sentiment'] < 0, ['label']] = -1
TrumpFinal.loc[TrumpFinal['sentiment'] == 0, ['label']] = 0
BidenFinal['label'] = 1
BidenFinal.loc[BidenFinal['sentiment'] < 0, ['label']] = -1
BidenFinal.loc[BidenFinal['sentiment'] == 0, ['label']] = 0
TrumpFinal['label_s'] = "Positive"
TrumpFinal.loc[TrumpFinal['sentiment'] < 0, ['label_s']] = 'Negative'
TrumpFinal.loc[TrumpFinal['sentiment'] == 0, ['label_s']] = 'Neutral'
BidenFinal['label_s'] = "Positive"
BidenFinal.loc[BidenFinal['sentiment'] < 0, ['label_s']] = 'Negative'
BidenFinal.loc[BidenFinal['sentiment'] == 0, ['label_s']] = 'Neutral'
#TrumpFinal.to_csv("TrumpFinal.csv", index = False)
First I'll take a look at the average likes per week for each candidate
trumpLikes = TrumpFinal.groupby([pd.Grouper(key='date', freq='W-MON')]).mean().reset_index()
bidenLikes = BidenFinal.groupby([pd.Grouper(key='date', freq='W-MON')]).mean().reset_index()
fig = go.Figure()
fig.add_trace(go.Scatter(x=bidenLikes.date, y=bidenLikes.likes,
mode='lines',
name='Biden'))
fig.add_trace(go.Scatter(x=trumpLikes.date, y=trumpLikes.likes,
mode='lines',
name='Trump'))
fig.update_layout(title='Average Weekly Tweet Likes',
xaxis_title='Date (by week)',
yaxis_title='Likes')
fig.show()
As we can see, Trump has a lot more likes, mostly because he was just a much more popular person than Biden because of his personality. It seems that around June 2020 is when Biden gets a huge jump in popularity making him be on par with Trump in terms of twitter likes.
We can assume that this is because Trump had 4 years to become famous, or infamous, while Biden needed a year of campaigning to get his popularity on part with Trump. We'll be only considering tweets from June 2020, onwards moving forward.
trumpSent = TrumpFinal[TrumpFinal.date > pd.to_datetime('2020-06-01', infer_datetime_format=True)].groupby([pd.Grouper(key='date', freq='D')]).mean().reset_index()
bidenSent = BidenFinal[BidenFinal.date > pd.to_datetime('2020-06-01', infer_datetime_format=True)].groupby([pd.Grouper(key='date', freq='D')]).mean().reset_index()
fig = go.Figure()
fig.add_trace(go.Scatter(x=bidenSent.date, y=bidenSent.label,
mode='lines',
name='lines'))
fig.add_trace(go.Scatter(x=bidenSent.date, y=trumpSent.label,
mode='lines',
name='lines'))
fig.add_shape(type='line',
x0=min(trumpSent.date),
y0=0,
x1=max(trumpSent.date),
y1=0,
line=dict(color='#00CC96',),
xref='x',
yref='y'
)
fig.update_layout(title='Daily Sentiment Score',
xaxis_title='Date',
yaxis_title='Sentiment')
fig.show()
A we can see from the chart above, both candidate's tweets are similar in their sentiment between June 2020 and November 2nd 2020.
However, Trump is noticeably more extreme in his positive and negative sentiment overall while Biden is slightly more neutral.
trumpLikes = TrumpFinal[TrumpFinal.date > pd.to_datetime('2020-06-01', infer_datetime_format=True)].groupby([pd.Grouper(key='date', freq='D')]).mean().reset_index()
bidenLikes = BidenFinal[BidenFinal.date > pd.to_datetime('2020-06-01', infer_datetime_format=True)].groupby([pd.Grouper(key='date', freq='D')]).mean().reset_index()
trumpSent = TrumpFinal[TrumpFinal.date > pd.to_datetime('2020-06-01', infer_datetime_format=True)].groupby([pd.Grouper(key='date', freq='D')]).mean().reset_index()
bidenSent = BidenFinal[BidenFinal.date > pd.to_datetime('2020-06-01', infer_datetime_format=True)].groupby([pd.Grouper(key='date', freq='D')]).mean().reset_index()
bidenLikes['vars'] = bidenLikes.likes.values - np.median(bidenLikes.likes)
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(go.Scatter(x=bidenSent.date, y=bidenSent.label,
mode='lines',
name='Sentiment',
marker_color='#3283FE'),
secondary_y=True);
fig.add_trace(
go.Bar(
x=bidenLikes.date,
y=bidenLikes.vars,
name='Likes',
marker_color='#FEAF16'),
secondary_y=False);
fig.add_shape(type='line',
x0=min(bidenSent.date),
y0=0,
x1=max(bidenSent.date),
y1=0,
line=dict(color='#00CC96',),
xref='x',
yref='y'
)
fig.update_layout(title='<b>Biden</b> Sentiment and Mean Tweet Likes Difference by Day',
xaxis_title='Date (by day)',
yaxis_title='Likes')
fig.update_yaxes(title_text="<b>Tweet Likes Difference</b>", range=[-40000, 100000],color='#FEAF16', secondary_y=False)
fig.update_yaxes(title_text="<b>Sentiment Score</b>", range=[-0.4, 1],color='#3283FE', secondary_y=True)
fig.show()
Based on the chart above, we can see that there is very little to no correlation to the disparity of daily likes and the sentiment for Biden.
BidenLatest = BidenFinal[BidenFinal.date > pd.to_datetime('2020-06-01', infer_datetime_format=True)]
np.corrcoef(BidenLatest.likes, BidenLatest.sentiment)
array([[ 1. , -0.03836922],
[-0.03836922, 1. ]])
The correlation coefficient between tweet likes and sentiment shows that the correlation coefficient is essentially zero.
print(px.colors.qualitative.Plotly)
['#636EFA', '#EF553B', '#00CC96', '#AB63FA', '#FFA15A', '#19D3F3', '#FF6692', '#B6E880', '#FF97FF', '#FECB52']
trumpLikes = TrumpFinal[TrumpFinal.date > pd.to_datetime('2020-06-01', infer_datetime_format=True)].groupby([pd.Grouper(key='date', freq='D')]).mean().reset_index()
bidenLikes = BidenFinal[BidenFinal.date > pd.to_datetime('2020-06-01', infer_datetime_format=True)].groupby([pd.Grouper(key='date', freq='D')]).mean().reset_index()
trumpSent = TrumpFinal[TrumpFinal.date > pd.to_datetime('2020-06-01', infer_datetime_format=True)].groupby([pd.Grouper(key='date', freq='D')]).mean().reset_index()
bidenSent = BidenFinal[BidenFinal.date > pd.to_datetime('2020-06-01', infer_datetime_format=True)].groupby([pd.Grouper(key='date', freq='D')]).mean().reset_index()
trumpLikes['vars'] = trumpLikes.likes.values - np.median(trumpLikes.likes.dropna())
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(go.Scatter(x=trumpSent.date, y=trumpSent.label,
mode='lines',
name='Sentiment',
marker_color='#636EFA'),
secondary_y=True);
fig.add_trace(
go.Bar(
x=trumpLikes.date,
y=trumpLikes.vars,
marker_color='#EF553B',
name='Liabskes'),
secondary_y=False);
fig.update_layout(title='<b>Trump</b> Sentiment and Mean Tweet Likes Difference by Day',
xaxis_title='Date (by day)',
yaxis_title='Likes',)
fig.add_shape(type='line',
x0=min(bidenSent.date),
y0=0,
x1=max(bidenSent.date),
y1=0,
line=dict(color='#00CC96',),
xref='x',
yref='y'
)
fig.update_yaxes(title_text="<b>Tweet Likes Difference</b>", range=[-70000, 110000],color='#EF553B', secondary_y=False)
fig.update_yaxes(title_text="<b>Sentiment Score</b>", range=[-0.7, 1.1],color='#636EFA', secondary_y=True)
fig.show()
Based on the chart above, we can see that there is slight more correlation between negavtive sentiment tweets and more likes.
TrumpLatest = TrumpFinal[TrumpFinal.date > pd.to_datetime('2020-06-01', infer_datetime_format=True)]
np.corrcoef(TrumpLatest.likes, TrumpLatest.sentiment)
array([[ 1. , -0.10669],
[-0.10669, 1. ]])
The correlation coefficient of -.11 confirms that there is a slight negative correlation meaning that as tweets get more negative in sentiment, they get more likes by a small ammount.
fig = px.scatter(TrumpFinal, x='sentiment', y='subjectivity', color = 'label_s')
fig.show()
From this plot, we can see that the more subjective the stronger the sentiment is
Since Textblob is not catching Trump's speech patterns correctly, we decided that we would attempt to train a classifier to identify positive and negative trump tweets. Our team manually classified around 800 Trump tweets from the latter half of 2020 to train our classifier
trumpnlp = pd.read_csv("./Data/TrumpNLP.csv")
trumpnlp.shape
(7979, 6)
We will use a Logistic Regression classifier, so we will remove neutral tweets, with a 0 sentiment. We will also drop all NAs as those will get rid of the tweets that we didn't get to manually classify.
trumpnlp=trumpnlp[trumpnlp.sentiment != 0].dropna().reset_index(drop=True).copy()
trumpnlp = trumpnlp.drop_duplicates(subset=['tweet']).copy()
print(trumpnlp.shape)
trumpnlp.head()
(470, 6)
| id | date | tweet | likes | retweets | sentiment | |
|---|---|---|---|---|---|---|
| 0 | 1.270000e+18 | 6/1/2020 12:27 | “These were the people that trashed Seattle ye... | 140518 | 38591 | -1.0 |
| 1 | 1.270000e+18 | 6/1/2020 12:44 | Sleep Joe Biden’s people are so Radical Left t... | 38147 | 10089 | -1.0 |
| 2 | 1.270000e+18 | 6/2/2020 13:24 | Congressman Andy Harris (@Harris4Congress) is ... | 30958 | 7964 | 1.0 |
| 3 | 1.270000e+18 | 6/2/2020 13:25 | Congresswoman Jackie Walorski (@jackiewalorski... | 67527 | 18135 | 1.0 |
| 4 | 1.270000e+18 | 6/2/2020 13:25 | Congressman Jim Banks (@jim_banks) is a fighte... | 32484 | 8395 | 1.0 |
Tweets that are shorter than 5 words will be removed, as there is usually not enough data to correctly assign a good sentiment.
trumpnlp = trumpnlp[trumpnlp.tweet.str.split().str.len().gt(5)].copy()
trumpnlp.shape
(456, 6)
The cleanerNLP function will perform several actions on the tweet data to prepare it for the classifier.
def cleanerNLP(txt):
#REMOVE SPECIAL CHARACTERS
#remove '@' mentions
txt=re.sub(r'@[A-Za-z0-9]+','',txt)
#Remove reweet symbols (RT)
txt=re.sub(r'#','',txt)
#remove hashtag symbols (#)
txt=re.sub(r'RT[\s]+','',txt)
#Remove website links
txt=re.sub(r'https?:\/\/\S+','',txt)
#Remove any other symbol
txt=re.sub(r'[^\w]', ' ', txt)
#Punctuation
txt=re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', txt)
txt=re.sub(r'amp', '', txt)
#LOWER CASE
txt=txt.lower()
#REMOVE STOP WORDS
txt=txt.split()
txt=[word for word in txt if not word in stopwords.words("english")]
txt=" ".join(txt)
#TOKENIZE
txt = nltk.WordPunctTokenizer().tokenize(txt)
#LEMMATIZE
txt = [nltk.WordNetLemmatizer().lemmatize(token) for token in txt]
return txt
trumpnlp = trumpnlp.copy()
trumpnlp['tweet'] = trumpnlp['tweet'].apply(cleanerNLP)
trumpnlp.head()
| id | date | tweet | likes | retweets | sentiment | |
|---|---|---|---|---|---|---|
| 0 | 1.270000e+18 | 6/1/2020 12:27 | [people, trashed, seattle, year, ago, paying, ... | 140518 | 38591 | -1.0 |
| 1 | 1.270000e+18 | 6/1/2020 12:44 | [sleep, joe, biden, people, radical, left, wor... | 38147 | 10089 | -1.0 |
| 2 | 1.270000e+18 | 6/2/2020 13:24 | [congressman, andy, harris, tremendous, advoca... | 30958 | 7964 | 1.0 |
| 3 | 1.270000e+18 | 6/2/2020 13:25 | [congresswoman, jackie, walorski, incredible, ... | 67527 | 18135 | 1.0 |
| 4 | 1.270000e+18 | 6/2/2020 13:25 | [congressman, jim, bank, bank, fighter, indian... | 32484 | 8395 | 1.0 |
To feed the cleaned tweet data to the logistic regression, we will need to turn it into a "bag of words" format, so that the Logistic Regression Model can properly process each word as a feature. The count of each word in a tweet will get totaled for each feature.
sklearn's CountVectorizer method will be used for this. Since I already tokenized the data beforehand, I will use a dummy function dummy to make the CountVectorizer ignore preprocessing.
def dummy(doc):
return doc
bow = CountVectorizer(tokenizer=dummy, preprocessor=dummy)
matrix = bow.fit_transform(trumpnlp['tweet'])
dfClean = pd.DataFrame(matrix.todense())
dfClean.columns = bow.get_feature_names()
Add the dependent variable to the Logistic Regression dataframe
dfClean["sentiment_score"] = trumpnlp.sentiment.copy()
Refactor the dependant variable for Logistic Regression (1=positive sentiment, 0=negative sentiment)
dfClean.loc[dfClean["sentiment_score"] == -1, ['sentiment_score']] = 0
dfClean = dfClean.dropna()
dfClean.head()
| 00 | 000 | 0022 | 1 | 10 | 100 | 100th | 10m | 11 | 12 | ... | yesterday | yet | york | young | younger | zero | zip | zone | zurich | sentiment_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.0 |
5 rows × 2402 columns
len(dfClean[dfClean.sentiment_score == 0])
228
len(dfClean[dfClean.sentiment_score == 1])
210
In our current dataset, we have 228 negative sentiment tweets and 210 positive sentiment tweets, so it's about 50/50. The logistic regression classifier must have an accuracy higher than 50% to be useful, as it would be the equivalent of a coin flip otherwise.
60% of the data will be used for training, 40% for testing. The data will be randomly shuffled and we'll perform stratified sampling to make sure we have a proportional ammount of negative and positive sentiment tweets in each split.
X_train, X_test, y_train, y_test = train_test_split(dfClean.iloc[:, :-1], dfClean.iloc[:, -1], stratify = dfClean.iloc[:, -1],test_size=0.40, random_state=42, shuffle = True)
print("Training Features shape: " + str(X_train.shape))
print("Test Features shape: " + str(X_test.shape))
print("Training Outcome shape: " + str(y_train.shape))
print("Test Outcome shape: " + str(y_test.shape))
Training Features shape: (262, 2401) Test Features shape: (176, 2401) Training Outcome shape: (262,) Test Outcome shape: (176,)
Fit the model
model = LogisticRegression().fit(X_train, y_train)
correlation matrix
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred, labels=[1,0])
sns.heatmap(cm, annot=True)
<AxesSubplot:>
model.score(X_test, y_test)
0.5965909090909091
The model predicted Trump's sentiment with a 60% accuracy. Obviously, this has major room for improvement, but our results are a start.